Please, read my outlines for the data visualization course in the the follow README file here and delves into learning about data visualization.

Preface

When I started coding for biology I realize on this amazing challenge about how to tell the history from a bunch of Next Generation sequencing datasets. Visualization of information (from massive data mining in special) become in a nice part of my data scientist training.

Trainning dataset

We will use high throughput sequencing dataset from a marine non model organism exposed to hidrocarbon polutant. The libraries sequenced were acqured from undifferented (Un) and sexual differented stages ( Male and female). In this experiment time 0 and three corresponded to hidrocarbon pollutant expossition (before and after, respectivily).

Libraries were preprocessing using the standar parameters within trimmomatic and the assembly were performed with trinity. Differential gene expression and annotation were perform with edgeR (an R package) and Trinotate, respectivily.

The chunk-code here correspond to my workflow used in the lab. Hope you enjoy it!

From your work directory clone the follow repo git clone https://github.com/RJEGR/July_2018_bioinfo.git. Example:

mkdir July_2018_bioinfo
cd July_2018_bioinfo
git clone https://github.com/RJEGR/July_2018_bioinfo.git

Required package:

# library(devtools)
# devtools::install_github("rlbarter/superheat")
# devtools::install_github("slowkow/ggrepel")
# install_github("cstubben/trinotateR")
# devtools::install_github("ropensci/taxa")
# devtools::install_github("grunwaldlab/metacoder")
# if (!require("DT")) install.packages('DT')
# source("https://bioconductor.org/biocLite.R"); biocLite("org.Hs.eg.db"); # (~ 74.3 MB)
# biocLite("ggtree")
# biocLite('phyloseq')
# if(!require(devtools)) install.packages("devtools")
# devtools::install_github("kassambara/ggpubr")
# 

Loading input files

dir <- c("/Users/cigom/Documents/GitHub/July_2018_bioinfo/infile/")
# setwd(dir)
x <- read.table(paste0(dir,"diffExpr.P1e-2_C1.matrix_control"))

And read the input file; then let’s make a head(x) from the file:

T0F3 T0F4 T0M1 T0M4 T0UR T0U
TRINITY_DN29981_c0_g1_i1 158.162 152.974 0.537 0.000 0.000 0.000
TRINITY_DN25178_c2_g1_i1 0.337 0.181 167.292 100.095 140.610 925.271
TRINITY_DN23469_c2_g2_i1 0.000 0.000 24.217 0.000 312.915 3789.992
TRINITY_DN27384_c1_g1_i1 54.064 150.841 0.000 0.000 0.000 0.095
TRINITY_DN32136_c1_g1_i2 0.000 0.000 34.300 3.078 48.656 93.654
TRINITY_DN24700_c1_g1_i3 0.000 0.000 2.190 0.000 50.995 2477.124

Let’s log2 transform the data. Why to use the log2 transformation of the normalized count table ? (please read an answer on a researchgate question here)

data = x # restore before doing various data transformations
data = log2(data+1)
data = as.matrix(data) # convert to matrix
data = t(scale(t(data), scale=F)) # Centering rows

Heatmap from Differential gene Expression

Using superheat

library(superheat)
superheat(data,
          # retain original order of rows/cols
          pretty.order.rows = TRUE,
          pretty.order.cols = TRUE,
          row.dendrogram = TRUE,
          left.label = "none",
          bottom.label.text.angle = 0,
          row.title = "Differential Expressed",
          column.title = "Samples")
example 1

example 1

Also use the option membership.cols to group the superheat accordint to any factor, for example: factor <- c(rep("Female",2), rep("Male",2), rep("Undiff",2))

factor <- c(rep("Female",2), rep("Male",2), rep("Undiff",2))
superheat(data,
          # retain original order of rows/cols
          pretty.order.rows = TRUE,
          pretty.order.cols = TRUE,
          row.dendrogram = TRUE,
          left.label = "none",
          row.title = "Differential Expressed",
          column.title = "Factor", 
          membership.cols = factor)
example 2

example 2

Volcano plot

After process the differential gene expression analysis (Ex. running the run_DE_analysis.pl from Trinity framework) we can improve the data visualization as follow:

DE <- read.table(paste0(dir,"RSEM.isoform.counts.matrix.Female_vs_Undiff.edgeR.DE_results"))
library(ggplot2)
library(scales)

p <- ggplot(DE, aes(x=logFC, y=-log10(PValue))) + geom_point()
p
This is a Volcano plot

This is a Volcano plot

Let’s label by Fold Change (up/down) rate and significancy by the follow condition:

  • log Fold Change (logFC) > abs(2)
  • False Discovery Rate (FDR) < 0.05
  • Both statements: FDR < 0.05 and logFC > abs(2)
# logf>abs(2) fdr < 0.05 fdr < 0.05 and logfc> abs(2)
DE$Sig <- "Non Sig or basal"
DE$Sig[(abs(DE$logFC) > 2)] <- "logFC"
DE$Sig[(DE$FDR<0.05) & (abs(DE$logFC)>2)] <- "logFC_FDR"

p <- ggplot(DE, aes(x=logFC, y=-log10(PValue))) +
    geom_point(aes(color = Sig)) + theme_classic() +
    scale_color_brewer()
p
Volcano; Color and fill

Volcano; Color and fill

Add lines and axis name:

p1 <- p +
        geom_vline(xintercept = 0) +
        geom_hline(yintercept = 0) +
        geom_hline(yintercept = -log10(0.0001), linetype = "dashed") +
        geom_vline(xintercept = c(-2, 2), linetype = "dashed")

And also let’s rename x and y axis using backquote macros:

p1 + xlab(bquote(~Log[2]~ "fold change")) + ylab(bquote(~-Log[10]~italic(P)))
Volcano lines and axis labeled

Volcano lines and axis labeled

# p2 <- p + xlab("Fold change (log2)") + ylab("P-Value")

Finally, let’s label the dots from the scatter:

library(ggrepel)
maxlab <- max(-log10(DE$PValue)) - 1 # select the points below the highest -log10(PValue) value to label

p2 <-  p1 + geom_text_repel(
          data = subset(DE, -log10(PValue) > maxlab),
          aes(label = Sig),
          size = 2.5,
          box.padding = unit(0.2, "lines"),
          point.padding = unit(0.2, "lines") 
  )

Finally, let’s compare the layouts:

library(gridExtra)
library(grid)
grid_arrange_shared_legend(p,p1,p2, ncol = 3)
Volcano comparison

Volcano comparison

Functional and enrichment annotation

After assembly denovo or guided transcriptome is common to map each contig to a reference in order to annotate the potential source of each transcript. In this view, blast2go is the average tool used by users to perform this analysis nevertheles blast2go is a non free tool. In spite of, Trinotate is a useful free-framework to this purpose. Trinotate makes use of a number of different well referenced methods for functional annotation including homology search to known sequence data (BLAST+/SwissProt), protein domain identification (HMMER/PFAM), protein signal peptide and transmembrane domain prediction (signalP/tmHMM), and leveraging various annotation databases (eggNOG/GO/Kegg databases).

In personal experience, trinotate had resulted in better performance (such less computational time) than the blast2go counterpart and useful to automatically annotate one workflow at time.

In addition, trinotateR is package developed by Chris Stubben with useful functions to “wrangling” the tab-delimited Trinotate.xls result.

library(trinotateR)
## Loading required package: data.table
y <- read_trinotate(paste0(dir,"Trinotate.xls"))
knitr::kable(summary_trinotate(y))
unique total
gene_id 61694 148534
transcript_id 147454 148534
prot_id 55785 55785
prot_coords 24521 55785
sprot_Top_BLASTX_hit 43800 45692
gene_ontology_blast 14000 44709
Kegg 15112 39430
eggnog 5651 37535
sprot_Top_BLASTP_hit 28519 35660
Pfam 26217 34005
gene_ontology_pfam 1775 20111
TmHMM 7764 10393
RNAMMER 9 9
SignalP 0 0
transcript 0 0
peptide 0 0

Most of the annotations contain mutliple hits in a backtick-delimited list and each hit contains multiple fields in a caret-delimited list. For example, the second Pfam annotation below contains two hits and each hit contains a pfam id, symbol, name, alignment and e-value. The split_pfam functions splits multiple hits and fields, so the second Pfam annotation is now printed in rows 2 and 3 below.

y1 <- split_pfam(y)
## 70856 Pfam annotations
head(y1,3)[,c(2,4:7)]
##                  transcript    pfam         symbol
## 1: TRINITY_DN10004_c0_g1_i1 PF01442 Apolipoprotein
## 2: TRINITY_DN10005_c0_g1_i1 PF16489           GAIN
## 3: TRINITY_DN10007_c0_g1_i1 PF00135     COesterase
##                                           name  align
## 1:               Apolipoprotein A1/A4/E domain  20-67
## 2: GPCR-Autoproteolysis INducing (GAIN) domain 17-121
## 3:                     Carboxylesterase family   3-85

Finally, the summary_pfam lists both, the number of unique Pfam identifiers and the total genes, transcripts and proteins with a Pfam annotation.

y2 <- summary_pfam(y1)
## 4849 rows
knitr::kable(head(y2[order(-y2$transcripts),]))
pfam symbol name genes transcripts proteins total
PF00386 C1q C1q domain 201 801 809 856
PF00069 Pkinase Protein kinase domain 279 594 594 611
PF07714 Pkinase_Tyr Protein tyrosine kinase 272 583 583 596
PF00400 WD40 WD domain, G-beta repeat 203 461 461 1351
PF12796 Ank_2 Ankyrin repeats (3 copies) 219 455 455 925
PF00023 Ank Ankyrin repeat 214 449 449 1031

The summary table also includes a count attribute with the number of unique genes, transcripts and proteins with a Pfam annotation, as well as the total number of annotations. In this example, there are 33,721 unique transcripts with a Pfam annotation and 56,642 total annotations to transcripts (since those may have more than one Pfam annotation).

c <- attr(y2, "count")
print(c)
##             unique annotations
## Pfam          4849       70856
## genes        14417       25280
## transcripts  33721       56642
## proteins     34005       56757

Implement datatable to present interactive data-table: from all the available genes/transcripts annotations. The R package DT is an R interface to the JavaScript library DataTable. It package use R data objects (matrices or data frames) to display it as tables on HTML pages. This object-class could include filtering, pagination, sorting, and many other features in the tables.

z <- data.frame(y2)
z$pfam <- paste0('<a href="http://pfam.xfam.org/family/', z$pfam, '">', z$pfam, '</a>')
library(DT)
datatable(z , escape=1, options = list( pageLength = 10 ) )
## Warning in instance$preRenderHook(instance): It seems your data is too
## big for client-side DataTables. You may consider server-side processing:
## https://rstudio.github.io/DT/server.html

Using trinotateR to get the Gene Ontology annotation.

A GO annotation is a statement about the function of a particular gene. Each GO annotation consists of an association between a gene and a GO term. Together, these statements comprise a “snapshot” of current biological knowledge. The GO describes function with respect to three aspects: molecular function (molecular-level activities performed by gene products), cellular component (the locations relative to cellular structures in which a gene product performs a function), and biological process (the larger processes, or ‘biological programs’ accomplished by multiple molecular activities)

Reference: http://www.geneontology.org/page/ontology-documentation, 2018

go <- split_GO(y)
## 473221 gene_ontology_blast annotations
gos <- summary_GO(go)
## 15386 rows
knitr::kable(head(gos[order(-gos$transcripts),]))
go ontology name genes transcripts proteins total
GO:0005737 cellular_component cytoplasm 5087 10521 8155 10550
GO:0005634 cellular_component nucleus 4286 8687 6947 8725
GO:0016021 cellular_component integral component of membrane 3839 7817 6020 7832
GO:0005829 cellular_component cytosol 3508 7300 5854 7325
GO:0005886 cellular_component plasma membrane 3366 6305 4961 6315
GO:0046872 molecular_function metal ion binding 2771 5815 4421 5829

And also determine the counts

cgo <- attr(gos, "count")
print(cgo)
##             unique annotations
## GO           15386      473221
## genes        19818      241902
## transcripts  44415      472268
## proteins     34463      379068

Using only the translated transcripts to get the differential Expressed annotations:

got <- na.omit(go, cols = "protein")
x$transcript <- rownames(x)
m <- merge(got, x, suffix = c("transcript"), all=FALSE)
m <- m[order(m$transcript),c(1,4,5:12)]

library(DT)
datatable(m , filter = 'top', escape=1, options = list( pageLength = 10 ), 
          rownames = FALSE)

Let’s draw some visualizations from the annotation enrichment throught a the transcriptome assembly using the package ggpubr:

library(ggpubr)
## Loading required package: magrittr
plotgos <- head(gos[order(-gos$transcripts),], 80)
ggbarplot(plotgos, "name", "transcripts",
          fill = "ontology", 
          color = "ontology",
   palette = c("#00AFBB", "#E7B800", "#FC4E07")) +
   theme(axis.text.x = element_text(angle = 90, hjust = 1, size = 7))
Gene Ontology enrichment plot (top 80 transcripts)

Gene Ontology enrichment plot (top 80 transcripts)

   # facet_grid(ontology ~ .,) + theme(#strip.text.x = element_blank(), 
   #                              axis.text.x = element_text(angle = 90, hjust = 1))

And separate the ontology annotation for further analysis:

Var1 Freq
biological_process 10236
cellular_component 1571
molecular_function 3579

You can separete the go terms to perfom further test:

MF <- gos[gos$ontology=="molecular_function",]
CC <- gos[gos$ontology=="cellular_component",]
BP <- gos[gos$ontology=="biological_process",]

Semantic similarity (of GO terms)

Determine the similarity of two GO terms based on the annotation statistics of their common ancestor terms by computing semantic similarity among GO terms, sets of GO terms, gene products, and gene clusters, providing different methods than measure the information content (IC). To details please read the Wang’s paper published in Oxford.

First, build annotation data needed by GOSemSim via godata function. Based in figure from Gene Ontology Enrichment, we could focus on the Cellular component (CC) terms.

library(org.Hs.eg.db)
## Loading required package: AnnotationDbi
## Loading required package: stats4
## Loading required package: BiocGenerics
## Loading required package: parallel
## 
## Attaching package: 'BiocGenerics'
## The following objects are masked from 'package:parallel':
## 
##     clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
##     clusterExport, clusterMap, parApply, parCapply, parLapply,
##     parLapplyLB, parRapply, parSapply, parSapplyLB
## The following object is masked from 'package:gridExtra':
## 
##     combine
## The following objects are masked from 'package:stats':
## 
##     IQR, mad, sd, var, xtabs
## The following objects are masked from 'package:base':
## 
##     anyDuplicated, append, as.data.frame, basename, cbind,
##     colMeans, colnames, colSums, dirname, do.call, duplicated,
##     eval, evalq, Filter, Find, get, grep, grepl, intersect,
##     is.unsorted, lapply, lengths, Map, mapply, match, mget, order,
##     paste, pmax, pmax.int, pmin, pmin.int, Position, rank, rbind,
##     Reduce, rowMeans, rownames, rowSums, sapply, setdiff, sort,
##     table, tapply, union, unique, unsplit, which, which.max,
##     which.min
## Loading required package: Biobase
## Welcome to Bioconductor
## 
##     Vignettes contain introductory material; view with
##     'browseVignettes()'. To cite Bioconductor, see
##     'citation("Biobase")', and for packages 'citation("pkgname")'.
## Loading required package: IRanges
## Loading required package: S4Vectors
## 
## Attaching package: 'S4Vectors'
## The following objects are masked from 'package:data.table':
## 
##     first, second
## The following object is masked from 'package:base':
## 
##     expand.grid
## 
## Attaching package: 'IRanges'
## The following object is masked from 'package:data.table':
## 
##     shift
## 
hsGO <- GOSemSim::godata('org.Hs.eg.db', ont="CC", computeIC=FALSE)
## 
## preparing gene to GO mapping data...
go <- as.vector(CC$go)
go1 <- sample(go, 20)
go2 <- sample(go, 20)
gosim <- GOSemSim::mgoSim(go1, go2, semData=hsGO, measure="Wang", combine=NULL)

And visualize the similarity of the GO term.

superheat(gosim,
          # retain original order of rows/cols
          pretty.order.rows = TRUE,
          pretty.order.cols = TRUE,
          #left.label = "none",
          bottom.label.text.angle = 90,
          row.title = "Sample 1",
          column.title = "Sample 2",
          bottom.label.text.size = 4,
          left.label.text.size = 4
          )

Finally detache (unload) org.Hs.eg.db package.

detach(package:org.Hs.eg.db, unload = TRUE)

Blast hits annotation

From the annotation, lets use the blastx result in order to get the lineage from proteins anntotated using a modified function from trinotateR

split_blast2 <- function (x, hit = "sprot_Top_BLASTX_hit") 
{
  y <- x[!is.na(get(hit)), .(get(hit), gene_id, transcript_id, 
                             prot_id)]
  z <- strsplit(y$V1, "`")
  n <- sapply(z, length)
  z <- strsplit(unlist(z), "\\^")
  if (any(sapply(z, "[", 1) != sapply(z, "[", 2))) 
    print("WARNING: check different values in columns 1 and 2")
  NAME <- gsub("^RecName: Full=", "", sapply(z, "[", 6))
  NAME <- gsub("SubName: Full=", "", NAME)
  NAME <- gsub(";$", "", NAME)
  NAME <- gsub(" \\{[^}]+}", "", NAME)
  x1 <- data.frame(gene = rep(y$gene_id, n), transcript = rep(y$transcript_id, 
                                                              n), protein = rep(gsub(".*\\|", "", y$prot_id), n), uniprot = sapply(z, 
                                                                                                                                   "[", 1), align = sapply(z, "[", 3), identity = as.numeric(gsub("%ID", 
                                                                                                                                                                                                  "", sapply(z, "[", 4))), evalue = as.numeric(gsub("E:", 
                                                                                                                                                                                                                                                    "", sapply(z, "[", 5))), name = NAME, lineage = sapply(z, 
                                                                                                                                                                                                                                                                                                           "[", 7), domain = gsub("; .*", "", sapply(z, "[", 7)), 
                   genus = gsub(".*; ", "", sapply(z, "[", 7)), stringsAsFactors = FALSE)
  message(nrow(x1), " ", hit, " annotations")
  data.table(x1)
}
blast <- split_blast2(y)
## 63540 sprot_Top_BLASTX_hit annotations
blast2 <- summary_blast(blast)
## 17039 rows
knitr::kable(head(blast2[order(-blast2$transcripts),]))
uniprot domain genus name genes transcripts proteins total
HS12B_HUMAN Eukaryota Homo Heat shock 70 kDa protein 12B 60 136 121 146
C1QL4_MOUSE Eukaryota Mus Complement C1q-like protein 4 47 122 106 124
HS12A_MOUSE Eukaryota Mus Heat shock 70 kDa protein 12A 51 113 90 116
HS12A_HUMAN Eukaryota Homo Heat shock 70 kDa protein 12A 53 99 64 99
PLCL_MYTGA Eukaryota Mytilus Perlucin-like protein 38 98 69 98
PLC_HALLA Eukaryota Haliotis Perlucin 34 97 81 100

Do you think we can re-annotate with the Gene ontology the Differential Expression data-set (DE object) and display a new annotated volcano plot as in the fist session ?

DE$transcript <- rownames(DE)
DEannot <- merge(DE, blast, suffix = c("transcript"), all=FALSE)


p1 + geom_text_repel(
          data = subset(DEannot, -log10(PValue) > 7),
          aes(label = name),
          size = 2.5,
          box.padding = unit(0.2, "lines"),
          point.padding = unit(0.2, "lines") 
  )
This is a Volcano plot annotated

This is a Volcano plot annotated

Color palette

Creates nice looking color palettes (discrete). Visit the follow cookbook to details

library(RColorBrewer)
display.brewer.all()

# display.brewer.pal(n = 5, name = 'Set3') # Hexadecimal color specification

Could we color with orange discrete scale the gosim object using superheat?. superheat could perform this by using the option heat.pal option.

superheat(gosim,
          # retain original order of rows/cols
          pretty.order.rows = TRUE,
          pretty.order.cols = TRUE,
          #left.label = "none",
          bottom.label.text.angle = 90,
          row.title = "Sample 1",
          column.title = "Sample 2",
          bottom.label.text.size = 4,
          left.label.text.size = 4,
          heat.pal = brewer.pal(n = 9, name = "Oranges"))

Taxonomic tree:

Visualizing hierarchical information

Metacoder.

Metacoder provides an alternative visulization we call heat trees to plot hierarchical information, in special the organism abundance data classified by a taxonomy.

Let’s subset the virus lineage annotation:

knitr::kable(table(blast$domain))
Var1 Freq
. 16
Archaea 137
Bacteria 1245
Eukaryota 61817
Viruses 325
vs <- blast[blast$domain == "Viruses",]
list <- vs$lineage

And plot …

library(metacoder)
## Loading required package: taxa
## 
## Attaching package: 'taxa'
## The following object is masked from 'package:ggplot2':
## 
##     map_data
## This is metacoder verison 0.2.1.9008 (development version). If you use metacoder for published research, please cite our paper:
## 
## Foster Z, Sharpton T and Grunwald N (2017). "Metacoder: An R package for visualization and manipulation of community taxonomic diversity data." PLOS Computational Biology, 13(2), pp. 1-15. doi: 10.1371/journal.pcbi.1005404
## 
## Enter `citation("metacoder")` for a BibTeX entry for this citation.
## 
## Attaching package: 'metacoder'
## The following object is masked from 'package:IRanges':
## 
##     reverse
library(RColorBrewer)
obj <- parse_tax_data(vs, class_cols = "lineage", class_sep = ";")
heat_tree(obj, 
          node_label = taxon_names,
          node_size = n_obs,
          node_color = n_obs,
          #edge_color = "grey",
          node_color_range = brewer.pal(n = 10, name = "Oranges"))
## Warning in brewer.pal(n = 10, name = "Oranges"): n too large, allowed maximum for palette Oranges is 9
## Returning the palette you asked for with that many colors
Tree example using metacoder

Tree example using metacoder

Which annotation could be true, let’s show the identity ?

heat_tree(obj, 
          edge_label = taxon_names,
          edge_label_size = 0.5,
          node_color = identity,
          edge_color = "grey",
          node_color_range = brewer.pal(n = 10, name = "Oranges"),
          node_size_axis_label = "n_obs",
          node_color_axis_label = "Identity")
## Warning in brewer.pal(n = 10, name = "Oranges"): n too large, allowed maximum for palette Oranges is 9
## Returning the palette you asked for with that many colors
## Warning: NAs found in the "node_color" option. These may cause plotting errors or other strange behavior. NAs can be created if there is not a value for each taxon. The following 28 of 48 taxa have NAs for the "node_color" option:
##   ab, ac, ad, ae, af, ag, ah, ai ... ba, bb, bh, bj, bm, bn, bo
Tree example 2; labeling by identity

Tree example 2; labeling by identity

ggtree

The ggtree is designed by extending the ggplot2 (Wickham 2009) package. It is based on the grammar of graphics and takes all the good parts of ggplot2. ggtree supports several layouts, including rectangular, slanted, circular and fan for phylogram and cladogram, equal_angle and daylight for unrooted layout, time-scaled and two dimentional phylogenies.

Reference: Yu et al. 2017, https://besjournals.onlinelibrary.wiley.com/doi/abs/10.1111/2041-210X.12628

tbl <- head(vs[order(-vs$identity),], 50)
tbl <- tbl[!duplicated(tbl$transcript),] # remove duplicates annotations
write.table(tbl$transcript, file = paste0(dir, "virus.list"), quote = FALSE, col.names = FALSE, row.names = FALSE)

Run in clustalo web-service and input again in r the tree this is a Neighbour-joining tree without distance corrections.

tree <- treeio::read.newick(paste0(dir,"virus.tree.corrected.txt"))

And plot

library(ggtree)
## ggtree v1.12.0  For help: https://guangchuangyu.github.io/software/ggtree
## 
## If you use ggtree in published research, please cite:
## Guangchuang Yu, David Smith, Huachen Zhu, Yi Guan, Tommy Tsan-Yuk Lam. ggtree: an R package for visualization and annotation of phylogenetic trees with their covariates and other associated data. Methods in Ecology and Evolution 2017, 8(1):28-36, doi:10.1111/2041-210X.12628
## 
## Attaching package: 'ggtree'
## The following object is masked from 'package:IRanges':
## 
##     collapse
## The following object is masked from 'package:S4Vectors':
## 
##     expand
## The following object is masked from 'package:ggpubr':
## 
##     rotate
## The following object is masked from 'package:magrittr':
## 
##     inset
multiplot(
        ggtree(tree, branch.length = "none") + geom_treescale(width=0.4),
        ggtree(tree, branch.length='none', layout="daylight") + geom_treescale(width=0.4) + geom_nodepoint(),
        ggtree(tree, branch.length='none', layout='circular') + geom_treescale(width=0.4),
        ncol=3,labels = LETTERS[1:3])
## Average angle change [1] 0.099133807614635
## Average angle change [2] 0.0415093787405126
ggtree; using different layouts

ggtree; using different layouts

Then,

p <-  ggtree(tree, branch.length='none', layout='circular') + geom_treescale(width=0.4) +
      geom_nodepoint(color="#b5e521", alpha=1/4, size=4) 

Modify tips and node geometry

multiplot(# add node points
        p + geom_nodepoint(),
        # add tip points
        p + geom_tippoint(),
        # Label the tips
        p + geom_tiplab(),
        ncol=3,labels = LETTERS[1:3])
ggtree; using different layouts

ggtree; using different layouts

Use annotation information to merge within object (ie. tree and dataframe)

d <- p + geom_nodepoint()

tbl <- tbl[,-c(1)]
colnames(tbl)[1] <- "label"
tbl$genus <- gsub("unclassified ", "", tbl$genus)


t1 <- d %<+% tbl + 
    geom_point(aes(color=genus), size=5, alpha=.5, na.rm = TRUE) +  theme(legend.position="bottom")

t2 <- p %<+% tbl + 
    geom_tiplab(size=2.5, aes(label=paste0('italic(', uniprot, ')')), parse=TRUE)

multiplot(t1, t2, ncol=2)
ggtree; labeling tree based on data-frame

ggtree; labeling tree based on data-frame

Improve your metagenomic visualization with phyloseq

This part of the course was cloned from a current version tutorial from https://joey711.github.io and is reproduced to demonstrate the simpleness of this tool

otus = matrix(sample(1:100, 100, replace = TRUE), nrow = 10, ncol = 10)
rownames(otus) <- paste0("OTU", 1:nrow(otus)) # name the rows with a label of name OTU#
colnames(otus) <- paste0("Sample", 1:ncol(otus)) # name the columns with a label of name Sample#
knitr::kable(otus)
Sample1 Sample2 Sample3 Sample4 Sample5 Sample6 Sample7 Sample8 Sample9 Sample10
OTU1 69 29 17 22 93 8 39 3 36 87
OTU2 62 32 37 65 74 46 51 48 66 1
OTU3 25 4 85 35 98 8 93 94 67 37
OTU4 54 46 3 87 54 81 47 54 94 72
OTU5 62 90 67 40 60 77 89 27 53 44
OTU6 73 34 24 66 67 93 48 27 78 4
OTU7 49 93 22 78 11 38 93 14 90 83
OTU8 1 19 83 30 23 92 5 52 30 54
OTU9 51 24 55 70 60 49 97 86 26 12
OTU10 60 70 9 1 40 74 55 61 41 97

Let’s make a tax table

taxmat = matrix(sample(letters, 70, replace = TRUE), 
                nrow = nrow(otus), 
                ncol = 7)
rownames(taxmat) <- rownames(otus) # The rownames must match the OTU names (taxa_names) of the otus
colnames(taxmat) <- c("Domain", "Phylum", "Class", "Order", "Family", "Genus", "Species")
knitr::kable(taxmat)
Domain Phylum Class Order Family Genus Species
OTU1 f v r p g z z
OTU2 d q g p x w c
OTU3 m a i v i l t
OTU4 m a b o u l q
OTU5 l u e b s x t
OTU6 m r x n x r m
OTU7 g w i k f q j
OTU8 c v j h m j r
OTU9 i u l n w x p
OTU10 w y o y m q q

Note than both, taxmat and otus, are objects of class matrix

class(taxmat)
## [1] "matrix"
class(otus)
## [1] "matrix"

….

library("phyloseq")
## 
## Attaching package: 'phyloseq'
## The following object is masked from 'package:taxa':
## 
##     filter_taxa
## The following object is masked from 'package:IRanges':
## 
##     distance
## The following object is masked from 'package:Biobase':
## 
##     sampleNames
OTU = otu_table(otus, taxa_are_rows = TRUE)
TAX = tax_table(taxmat)
physeq = phyloseq(OTU, TAX)

print(physeq)
## phyloseq-class experiment-level object
## otu_table()   OTU Table:         [ 10 taxa and 10 samples ]
## tax_table()   Taxonomy Table:    [ 10 taxa by 7 taxonomic ranks ]

Let’s visualize some barplots:

theme_set( theme_classic())
bar <- plot_bar(physeq, fill = "Genus") 

# Color with scale_fill_brewer

bar + geom_bar(stat="identity") +
      scale_fill_brewer(palette="Spectral")
phyloseq; barplot 1

phyloseq; barplot 1

Create random sample data, and add that to the combined dataset. Make sure that the sample names match the sample_names of the otu_table.

sampledata = data.frame(
  Location = sample(LETTERS[1:4], size=nsamples(physeq), replace=TRUE),
  Depth = sample(50:1000, size=nsamples(physeq), replace=TRUE),
  row.names=sample_names(physeq),
  stringsAsFactors=FALSE
  )

knitr::kable(sampledata)
Location Depth
Sample1 A 683
Sample2 C 251
Sample3 A 668
Sample4 A 755
Sample5 C 354
Sample6 C 986
Sample7 D 620
Sample8 B 182
Sample9 D 277
Sample10 A 56
sampledata <- sample_data(sampledata) # save as phyloseq-object

Finally, merge sampledata within phyloseq-object

physeq1 = merge_phyloseq(physeq, sampledata)
print(physeq1)
## phyloseq-class experiment-level object
## otu_table()   OTU Table:         [ 10 taxa and 10 samples ]
## sample_data() Sample Data:       [ 10 samples by 2 sample variables ]
## tax_table()   Taxonomy Table:    [ 10 taxa by 7 taxonomic ranks ]

And improve the plots. Keep the same fill color, and group the samples together by the Location variable (essentially the environment from which the sample was taken and sequenced) described in the sample_data(phyloseq1):

plot_bar(physeq1, x="Location", fill="Genus") +
    geom_bar(stat="identity") +
      scale_fill_brewer(palette="Spectral")
phyloseq; barplot 2

phyloseq; barplot 2

Also we can organize the information according to any factor using facets. Let’s use the Location in order to group the plots, this time xlab will be labeled with Family name, and the fill of the bars will be colored by Genus level.

plot_bar(physeq1, "Family", fill="Genus", facet_grid=~Location) +
  geom_bar(stat="identity") +
      scale_fill_brewer(palette="Spectral")
barplot with facet_grid

barplot with facet_grid

You can improve you barpots as phyloseq authors sugguest here

Also we could implement tree information associated to the physeq1 object. Let’s create a random phylogenetic tree with the ape package, and add it to your dataset. Make sure its tip labels match your OTU_table.

library("ape")
## 
## Attaching package: 'ape'
## The following object is masked from 'package:ggtree':
## 
##     rotate
## The following object is masked from 'package:metacoder':
## 
##     complement
## The following object is masked from 'package:ggpubr':
## 
##     rotate
rtree = rtree(ntaxa(physeq1), rooted=TRUE, tip.label=taxa_names(physeq1))

Great, lets merge the tree in the physoleq- object

physeq1 = merge_phyloseq(physeq1, rtree)
physeq1
## phyloseq-class experiment-level object
## otu_table()   OTU Table:         [ 10 taxa and 10 samples ]
## sample_data() Sample Data:       [ 10 samples by 2 sample variables ]
## tax_table()   Taxonomy Table:    [ 10 taxa by 7 taxonomic ranks ]
## phy_tree()    Phylogenetic Tree: [ 10 tips and 9 internal nodes ]

Also you can make a denovo phyloseq-object construction from the scratch by: physeq1 = phyloseq(OTU, TAX, sampledata, rtree).

plot_tree(physeq1, 
          color="Location", 
          label.tips="taxa_names", 
          size="abundance", 
          ladderize="left", 
          plot.margin=0.3)
phyloseq; tree with location labels

phyloseq; tree with location labels

plot_tree(physeq1, 
          color="Depth", 
          shape="Location", 
          label.tips="taxa_names", 
          ladderize="right", 
          plot.margin=0.3)
phyloseq; tree with location labels 2

phyloseq; tree with location labels 2

Heatmaps

plot_heatmap(physeq1)
plot heatmap from phyloseq-objects

plot heatmap from phyloseq-objects

Improve the heatmap by adding the taxa names in the ylab and order the xlab by the ecological distance using an ordination method.

h <- plot_heatmap(physeq1, low="#66CCFF", high="#000033",
            taxa.label="Family",
            method = "NMDS", distance = "bray"
  )
print(h)
plot heatmap from phyloseq-objects using ordination and taxa names

plot heatmap from phyloseq-objects using ordination and taxa names

We can use the ggplot2 functions to make a final arragement of the heatmap

h + facet_grid(Class ~ . + Location, scales = "free", space = "free") 

and dotplot

dp <- phyloseq::psmelt(physeq1)

library(ggpubr)
ggdotchart(dp, x = "Sample", y ="Abundance",
  ggtheme = theme_bw())

and

ggdotchart(dp, x = "Sample", y ="Abundance",
 group = "Location", color = "Domain",
   palette = "jco",
   rotate = TRUE,
   sorting = "descending",
   ggtheme = theme_bw(),
   y.text.col = TRUE,
   dot.size = "Abundance")

Final words

Improve the visualization of libraries quality using multiqc.

See the multiqc_report.html in the infile folder. If you’re interested in do this in-lab-plots, please visit the brief pre-processing step documentation.

This tutorial were performed using the R session:

print(sessionInfo())
## R version 3.5.0 (2018-04-23)
## Platform: x86_64-apple-darwin15.6.0 (64-bit)
## Running under: macOS High Sierra 10.13.6
## 
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
##  [1] parallel  stats4    grid      stats     graphics  grDevices utils    
##  [8] datasets  methods   base     
## 
## other attached packages:
##  [1] ape_5.1              phyloseq_1.24.2      ggtree_1.12.0       
##  [4] metacoder_0.2.1.9008 taxa_0.2.1.9009      RColorBrewer_1.1-2  
##  [7] AnnotationDbi_1.42.1 IRanges_2.14.10      S4Vectors_0.18.3    
## [10] Biobase_2.40.0       BiocGenerics_0.26.0  ggpubr_0.1.6        
## [13] magrittr_1.5         DT_0.4               trinotateR_1.0      
## [16] data.table_1.11.4    gridExtra_2.3        ggrepel_0.8.0       
## [19] scales_0.5.0         ggplot2_3.0.0        bindrcpp_0.2.2      
## [22] superheat_1.0.0     
## 
## loaded via a namespace (and not attached):
##  [1] nlme_3.1-137      bit64_0.9-7       ggsci_2.9        
##  [4] rprojroot_1.3-2   tools_3.5.0       backports_1.1.2  
##  [7] vegan_2.5-2       R6_2.2.2          mgcv_1.8-23      
## [10] DBI_1.0.0         lazyeval_0.2.1    colorspace_1.3-2 
## [13] permute_0.9-4     ade4_1.7-11       withr_2.1.2      
## [16] tidyselect_0.2.4  bit_1.1-14        compiler_3.5.0   
## [19] cli_1.0.0         ggdendro_0.1-20   labeling_0.3     
## [22] stringr_1.3.1     digest_0.6.15     rmarkdown_1.10   
## [25] XVector_0.20.0    pkgconfig_2.0.1   htmltools_0.3.6  
## [28] GA_3.1.1          highr_0.7         htmlwidgets_1.2  
## [31] rlang_0.2.1       RSQLite_2.1.1     shiny_1.1.0      
## [34] bindr_0.1.1       jsonlite_1.5      crosstalk_1.0.0  
## [37] GOSemSim_2.6.0    dplyr_0.7.6       GO.db_3.6.0      
## [40] Matrix_1.2-14     biomformat_1.8.0  Rhdf5lib_1.2.1   
## [43] Rcpp_0.12.18      munsell_0.5.0     ggfittext_0.6.0  
## [46] stringi_1.2.4     yaml_2.1.19       MASS_7.3-49      
## [49] zlibbioc_1.26.0   rhdf5_2.24.0      plyr_1.8.4       
## [52] blob_1.1.1        promises_1.0.1    crayon_1.3.4     
## [55] lattice_0.20-35   splines_3.5.0     Biostrings_2.48.0
## [58] multtest_2.36.0   knitr_1.20        pillar_1.2.3     
## [61] igraph_1.2.2      reshape2_1.4.3    codetools_0.2-15 
## [64] glue_1.2.0        evaluate_0.10.1   treeio_1.4.1     
## [67] httpuv_1.4.4.1    foreach_1.4.4     gtable_0.2.0     
## [70] purrr_0.2.5       tidyr_0.8.1       assertthat_0.2.0 
## [73] mime_0.5          xtable_1.8-2      tidytree_0.1.9   
## [76] later_0.7.3       survival_2.41-3   tibble_1.4.2     
## [79] iterators_1.0.10  rvcheck_0.1.0     memoise_1.1.0    
## [82] cluster_2.0.7-1